[Day 10] 以R語言分詞 - 使用 quanteda 與 jiebaR - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 10

AI & Data

用R語言玩轉文字探勘系列第 10 篇

[Day 10] 以R語言分詞 - 使用 quanteda 與 jiebaR

15th鐵人賽 r r語言 text mining 文字探勘

rlover

2023-09-25 00:58:22

539 瀏覽

分享至

利用R語言分詞

分詞流程 - 利用 quanteda

library(quanteda)
# Chinese stopwords
ch_stop <- quanteda::stopwords("zh", source = "misc")
# read text files
corp <- corpus(df_speech_pre)
# tokenize
ch_toks <- corp %>% 
    tokens(remove_punct = TRUE) %>%
    tokens_remove(pattern = ch_stop)

# construct a dfm
ch_dfm <- dfm(ch_toks)
topfeatures(ch_dfm)

##     我們     發展     民主     自由     社會       更 中華民國     國家     繁榮     成就 
##       11        7        6        5        5        5        4        4        4        3

分詞流程 - 利用 jiebaR

如果是改用繁體中文語料最常使用的 jiebaR 呢？

library(jiebaR)

### your code
### segment
cutter <- worker("tag", stop_word = "data/停用詞-繁體中文.txt")
vector_word <- c("中華民國", "李登輝", "蔣中正", "蔣經國", "李登輝")
new_user_word(cutter, words = vector_word)

## [1] TRUE

# reg_space <- "%E3%80%80" %>% curl::curl_escape()

### text part
df_speech_seg <-
  df_speech_pre %>% 
  mutate(text = str_replace_all(text, "台灣|臺灣", "臺灣")) %>%
  mutate(text = str_remove_all(text, "\\n|\\r|\\t|:| |　")) %>%
  # mutate(text = str_remove_all(text, reg_space)) %>%
  mutate(text = str_remove_all(text, "[a-zA-Z0-9]+")) %>%
  mutate(text_segment = purrr::map(text, function(x)segment(x, cutter))) %>%
  mutate(text_POS = purrr::map(text_segment, function(x)names(x))) %>%
  unnest(c(text_segment, text_POS)) %>%
  select(-text, everything(), text)

df_speech_seg %>% count(text_segment, sort = T) %>%
  anti_join(df_stop, by = c("text_segment" = "stopword"))

## # A tibble: 187 × 2
##    text_segment     n
##    <chr>        <int>
##  1 發展             7
##  2 國家             5
##  3 社會             5
##  4 中華民國         4
##  5 民主             4
##  6 繁榮             4
##  7 自由             4
##  8 兩岸             3
##  9 基礎             3
## 10 歷史             3
## # ℹ 177 more rows

分詞品質與限制

我們先來實際查看斷詞實際結果差異。

tibble(rank = 1:10) %>%
  bind_cols(df_speech_seg %>% count(text_segment, sort = T) %>%
  anti_join(df_stop, by = c("text_segment" = "stopword")) %>%
  select(jieba = 1) %>% head(10)) %>%
  bind_cols(df_speech_pre %>% select(text) %>% head(1) %>%
  unnest_tokens(output = word, input = text, token = "words") %>%
  count(word, sort = T) %>%
  anti_join(df_stop, by = c("word" = "stopword")) %>%
  select(tidyext = 1) %>% head(10)) %>%
  bind_cols(tibble(quanteda = names(topfeatures(ch_dfm))))

## # A tibble: 10 × 4
##     rank jieba    tidyext  quanteda
##    <int> <chr>    <chr>    <chr>   
##  1     1 發展     發展     我們    
##  2     2 國家     民主     發展    
##  3     3 社會     社會     民主    
##  4     4 中華民國 自由     自由    
##  5     5 民主     中華民國 社會    
##  6     6 繁榮     國家     更      
##  7     7 自由     繁榮     中華民國
##  8     8 兩岸     兩岸     國家    
##  9     9 基礎     基礎     繁榮    
## 10    10 歷史     成就     成就

有幾個可能性影響。

第一當然是斷詞引擎或稱分詞器（tokenizer）差異，這會大幅影響斷詞成果，因為品質取決斷詞引擎的功力是否高深，而這又往前牽涉到該分詞器如何訓練、用什麼資料訓練。

第二則是斷詞前提供辭典（dictionary）帶來差異。

第三則是停用詞辭典，在 jiebaR 和 tidytext 中我同樣使用繁體中文語料常見停用詞，簡體中文語料可以參考其他資源。但是在 quanteda 中，套件本身就有預設停用詞辭典，因此不適用。

其他挑戰還有多語言文本、辨別專有名詞、處理俚語等（例如PTT、Dcard各自文化）。